In this project, the data set we have selected for analysis is the No-show appointments data set. This data set is a collection of the medical appointments from Brazil in the year 2016. Our main focus is to determine whether or not patients show up for appointments and find out what factors are important to know in order to predict if a patient will show up for appointment or not. The dataset consists of $14$ columns which tells us about the characteristics of the patients. Some of which include the following
The following are the questions of consideration in our exploration.
We begin by first importing all the necessary packages that we will require for our analysis. Upon loading our data set, we will begin by first assessing and buliding initution about the data set. To do this, we will check the shape of our data set, the data types of the columns and a concise summary of the dataframe using df.info() as well as the descriptive statistics for each column of the dataframe. After assessing the data, we will then proceed to cleaning our data by checking for missing values, duplicates and also making a few changes on the column names to maintain consistency. We will then explore the data with visuals to find patterns by creating histograms, bar charts and pie charts. Finally, we will then draw conclusions based on descriptive statistics and visualizations.
# imorting all the necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import calendar
%matplotlib inline
# Loading no_show appointments dataset
df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
# checking the first 5 rows of the dataset
df.head()
| PatientId | AppointmentID | Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.987250e+13 | 5642903 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | 0 | No |
| 1 | 5.589978e+14 | 5642503 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 2 | 4.262962e+12 | 5642549 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 3 | 8.679512e+11 | 5642828 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | 0 | 0 | 0 | 0 | 0 | 0 | No |
| 4 | 8.841186e+12 | 5642494 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | 0 | 1 | 1 | 0 | 0 | 0 | No |
# checking the shape of our data set.
df.shape
(110527, 14)
The sample size for our data set is $ 110,527$ and it consists of $14$ columns.
# checking the data types of the dataset.
df.dtypes
PatientId float64 AppointmentID int64 Gender object ScheduledDay object AppointmentDay object Age int64 Neighbourhood object Scholarship int64 Hipertension int64 Diabetes int64 Alcoholism int64 Handcap int64 SMS_received int64 No-show object dtype: object
# checking the number of unique values in each column.
df.nunique()
PatientId 62299 AppointmentID 110527 Gender 2 ScheduledDay 103549 AppointmentDay 27 Age 104 Neighbourhood 81 Scholarship 2 Hipertension 2 Diabetes 2 Alcoholism 2 Handcap 5 SMS_received 2 No-show 2 dtype: int64
# Checking some general information about our dataset.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110527 entries, 0 to 110526 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PatientId 110527 non-null float64 1 AppointmentID 110527 non-null int64 2 Gender 110527 non-null object 3 ScheduledDay 110527 non-null object 4 AppointmentDay 110527 non-null object 5 Age 110527 non-null int64 6 Neighbourhood 110527 non-null object 7 Scholarship 110527 non-null int64 8 Hipertension 110527 non-null int64 9 Diabetes 110527 non-null int64 10 Alcoholism 110527 non-null int64 11 Handcap 110527 non-null int64 12 SMS_received 110527 non-null int64 13 No-show 110527 non-null object dtypes: float64(1), int64(8), object(5) memory usage: 11.8+ MB
From the above information, we see that there are no null values present in our data set.
# checking the distribution of the different characteristic features of our data set
df.hist(figsize=(10,10));
# checking if the dataset consists of duplicates.
sum(df.duplicated())
0
From the above results from the code, we see that the dataset contains no duplicate values. We now check if at all there are any missing values.
# checking for null values
df.isnull().sum()
PatientId 0 AppointmentID 0 Gender 0 ScheduledDay 0 AppointmentDay 0 Age 0 Neighbourhood 0 Scholarship 0 Hipertension 0 Diabetes 0 Alcoholism 0 Handcap 0 SMS_received 0 No-show 0 dtype: int64
The above result confirms that there are no missing or null values in this dataset
# checking the summary statistics of our dataset
df.describe()
| PatientId | AppointmentID | Age | Scholarship | Hipertension | Diabetes | Alcoholism | Handcap | SMS_received | |
|---|---|---|---|---|---|---|---|---|---|
| count | 1.105270e+05 | 1.105270e+05 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 | 110527.000000 |
| mean | 1.474963e+14 | 5.675305e+06 | 37.088874 | 0.098266 | 0.197246 | 0.071865 | 0.030400 | 0.022248 | 0.321026 |
| std | 2.560949e+14 | 7.129575e+04 | 23.110205 | 0.297675 | 0.397921 | 0.258265 | 0.171686 | 0.161543 | 0.466873 |
| min | 3.921784e+04 | 5.030230e+06 | -1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 4.172614e+12 | 5.640286e+06 | 18.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 3.173184e+13 | 5.680573e+06 | 37.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 9.439172e+13 | 5.725524e+06 | 55.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| max | 9.999816e+14 | 5.790484e+06 | 115.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 1.000000 |
From the summary statistics in the above code, we observe that the minimum value for age is negative. This could be as a result of a mismatch which might have occured during data collection. We can thus remove all rows which have age as a negative value. We will also drop the rows with age $0$ as it is not logical to have age as zero.
# dropping the rows which contains age as a negative value as well as 0 value.
df.drop(df.query('Age <= 0 ').index,inplace=True)
# checking if the changes made to age
df.Age.describe()
count 106987.000000 mean 38.316085 std 22.466214 min 1.000000 25% 19.000000 50% 38.000000 75% 56.000000 max 115.000000 Name: Age, dtype: float64
We now have dropped the negative value for age as well as all rows containing age as $0$. We now see that the minimum age is no longer negative.
We now can also drop the columns which will not be of much help in our analysis.
# checking the unique values for Handcap
df['Handcap'].unique()
array([0, 1, 2, 3, 4], dtype=int64)
The Handcap column consists of numerical values 1, 2, 3 and 4 as it is difficult to encode what these numbers means or signify, we shall drop this column.
# drop columns that are extraneous from the dataset
df.drop(['PatientId', 'AppointmentID', 'Handcap'], axis=1, inplace=True)
# confirm changes
df.head(1)
| Gender | ScheduledDay | AppointmentDay | Age | Neighbourhood | Scholarship | Hipertension | Diabetes | Alcoholism | SMS_received | No-show | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | No |
To maintain consistency in the column names, we will make a few changes to make our column names more tidy as well as correct an error in the spelling for Hipertension to Hypertension. The following are the changes which we are going to make.
# Renaming the columns
df.rename(columns={'ScheduledDay':'Scheduled_day', 'AppointmentDay':'Appointment_day','Hipertension':'Hypertension','No-show':'No_show'},inplace=True)
#checking/ confirming changes in the column names
df.head(1)
| Gender | Scheduled_day | Appointment_day | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | SMS_received | No_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | 0 | 1 | 0 | 0 | 0 | No |
We now properly encode the data in the following columns: Scholarship, Hypertension, Diabetes, Alcoholism, SMS_received and No_show. We do this so that we can make it more easier to analyse and interpret the meaning of the information in these columns. The following are the changes which we are going to make.
# making the changes to the column values with 0 and 1 to No and yes respectively
df[['Scholarship','Hypertension','Diabetes','Alcoholism','SMS_received']] = df[['Scholarship','Hypertension','Diabetes','Alcoholism','SMS_received']].replace([0,1],['No','Yes'])
# changing the entries in the No_show column
df['No_show'] = df['No_show'].replace(['No','Yes'],['Showed_up','No_show_up'])
# confirming/ checking the changes made
df.head()
| Gender | Scheduled_day | Appointment_day | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | SMS_received | No_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29T18:38:08Z | 2016-04-29T00:00:00Z | 62 | JARDIM DA PENHA | No | Yes | No | No | No | Showed_up |
| 1 | M | 2016-04-29T16:08:27Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | No | No | No | No | No | Showed_up |
| 2 | F | 2016-04-29T16:19:04Z | 2016-04-29T00:00:00Z | 62 | MATA DA PRAIA | No | No | No | No | No | Showed_up |
| 3 | F | 2016-04-29T17:29:31Z | 2016-04-29T00:00:00Z | 8 | PONTAL DE CAMBURI | No | No | No | No | No | Showed_up |
| 4 | F | 2016-04-29T16:07:23Z | 2016-04-29T00:00:00Z | 56 | JARDIM DA PENHA | No | Yes | Yes | No | No | Showed_up |
# We change the date format for Scheduled_day and Appointment_day to date_time
df['Scheduled_day'] = pd.to_datetime(df['Scheduled_day'])
df['Scheduled_day'] = df['Scheduled_day'].dt.date
df['Appointment_day'] = pd.to_datetime(df['Appointment_day'])
df['Appointment_day'] = df['Appointment_day'].dt.date
# confirm changes
df.head()
| Gender | Scheduled_day | Appointment_day | Age | Neighbourhood | Scholarship | Hypertension | Diabetes | Alcoholism | SMS_received | No_show | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | F | 2016-04-29 | 2016-04-29 | 62 | JARDIM DA PENHA | No | Yes | No | No | No | Showed_up |
| 1 | M | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | No | No | No | No | No | Showed_up |
| 2 | F | 2016-04-29 | 2016-04-29 | 62 | MATA DA PRAIA | No | No | No | No | No | Showed_up |
| 3 | F | 2016-04-29 | 2016-04-29 | 8 | PONTAL DE CAMBURI | No | No | No | No | No | Showed_up |
| 4 | F | 2016-04-29 | 2016-04-29 | 56 | JARDIM DA PENHA | No | Yes | Yes | No | No | Showed_up |
# checking the shape of the cleaned dataset
df.shape
(106987, 11)
# checking the number of female and male patients
df['Gender'].value_counts()
F 70118 M 36869 Name: Gender, dtype: int64
We now create a new data set consisting of patients who don't show up for appointments.
# get the patients who show up for appointments and those who did not in two separate data frames
df_show=df.query('No_show=="Showed_up"')
df_noshow=df.query('No_show=="No_show_up"')
# checking the total number of females in the entire dataset
f_total=df.query('Gender =="F"').count()['Age']
# checking the total number of males in the entire dataset
m_total=df.query('Gender =="M"').count()['Age']
# we define a function to group any feature by no_show
def stat(y):
fea_stat=df_noshow.groupby([y],as_index=False)["No_show"].count()
return pd.DataFrame(fea_stat)
# grouping the no show data frame by Gender and No_show using the function defined above.
stat('Gender')
| Gender | No_show | |
|---|---|---|
| 0 | F | 14275 |
| 1 | M | 7405 |
# we find the proportions of females and males who do not show up for appointments
fprop_noshow=(14275/f_total)*100 # proportion of females who do not show up for appointments
mprop_noshow=(7405/m_total)*100 #proportion of males who do not show up for appointments
print('The proportion of female patients is',fprop_noshow, "and",mprop_noshow,"for male patients who do not show up.")
The proportion of female patients is 20.358538463732565 and 20.084623938810385 for male patients who do not show up.
From the calculation of the proportions above, we can observe that females are more likely to miss appointments than males with a very small margin.
# plotting the bar chart for Gender and No_show appointments.
locations=[1,2]
labels=['Female','Male']
heights=[fprop_noshow,mprop_noshow]
plt.bar(locations,heights,tick_label=labels, color=['red','purple'])
plt.title('Gender vs No_show',fontsize=14)
plt.xlabel("Gender",fontsize=14)
plt.ylabel("Percentage of No_show", fontsize=14);
stat('Gender')
| Gender | No_show | |
|---|---|---|
| 0 | F | 14275 |
| 1 | M | 7405 |
From the bar chart above we can conclude that females are more likely to miss appointments than males though by a very small margin.
# checking the correlation between gender and No_show.
df['No_show']=df['No_show'].astype('category').cat.codes
df['Gender']=df['Gender'].astype('category').cat.codes
df.corr()
| Gender | Age | No_show | |
|---|---|---|---|
| Gender | 1.000000 | -0.094108 | 0.003238 |
| Age | -0.094108 | 1.000000 | 0.067025 |
| No_show | 0.003238 | 0.067025 | 1.000000 |
We also see that there is a very weak positive correlation between the gender of the patients and No_show
# Checking the total number of patients with Scholarships.
total_01= df.query('Scholarship == "Yes"').count()['Age']
# Checking the total number of patients with no Scholarships.
total_02=df.query('Scholarship == "No"').count()['Age']
# we group the no show dataframe by scholarship and No_show using the function we defined earlier.
stat('Scholarship')
| Scholarship | No_show | |
|---|---|---|
| 0 | No | 19109 |
| 1 | Yes | 2571 |
# finding the proportions of patients who did not show up for appointment and whether or not they had scholarships.
with_sch=(2571/total_01)*100
withno_sch=( 19109/total_02)*100
print('The proportion of patients with scholarships is',with_sch, "and",withno_sch,"with no scholarships")
The proportion of patients with scholarships is 23.7857341104635 and 19.8683690656907 with no scholarships
# plotting the bar chart for proportion of patients who did not show up for appointment and whether
# they had scholarships or not.
locations = [1,2]
label=['Scholarship','No_scholarship']
heights=[with_sch,withno_sch]
plt.bar(locations, heights,tick_label=label,color=['pink','grey'])
plt.xlabel('Scholarship',fontsize=14)
plt.ylabel('percentage of No_show',fontsize=14)
plt.title('Scholarship vs No_show',fontsize=14)
stat('Scholarship')
| Scholarship | No_show | |
|---|---|---|
| 0 | No | 19109 |
| 1 | Yes | 2571 |
From the bar chart above, we observe that the proportion of patients who have no scholarships are less likely to miss appointments compared to those who have scholarships.
# Plotting the historam of the age distibution of no show appointments.
df_noshow['Age'].plot(kind='hist',figsize=(5,5),color= 'green');
plt.title('Age distribution of No-show Appointments',fontsize=14)
plt.xlabel('Patients Age',fontsize=14);
plt.ylabel('Count of Appointments',fontsize=14);
pd.DataFrame(df.Age.describe())
| Age | |
|---|---|
| count | 106987.000000 |
| mean | 38.316085 |
| std | 22.466214 |
| min | 1.000000 |
| 25% | 19.000000 |
| 50% | 38.000000 |
| 75% | 56.000000 |
| max | 115.000000 |
From the histogram above, we see that the distribution of age is very much skewed to the right. This means that the majority of no show appointments are common in children and adults. Most of the elderly patients are more likely to show up for appointments.
# checking the total number of patients who received an Sms.
total_001=df.query('SMS_received == "Yes"').count()['Age']
# checking the total number of patients who did not received an Sms.
total_002=df.query('SMS_received == "No"').count()['Age']
# grouping the No_show data frame by SMS_received and whether they showed up for an appointment or not
# using the function stat.
stat('SMS_received')
| SMS_received | No_show | |
|---|---|---|
| 0 | No | 12112 |
| 1 | Yes | 9568 |
def chart(x):
"""This function groups a certain feature (x) by no-show and plots a pie chart representing the % of no-show
appointments in relation to the given feature.
"""
fea_stat=df_noshow.groupby([x],as_index=False)["No_show"].count()
chart_plot=px.pie(fea_stat,names=x,values='No_show',color_discrete_sequence=px.colors.sequential.RdBu,width=700,height=600 )
return chart_plot
# plotting the pie chart to show the percentage of no show appointments in relation to Sms received using
# the function defined above
print('Proportion of no-show appointments in relation to SMS_received.')
chart('SMS_received')
Proportion of no-show appointments in relation to SMS_received.
From the pie chart above, we see that 44.1% of the patients who did not show up for appointment received Sms notifications and 55.9% of patients who did not receive the notification did not show up.
We now find the proportions of patients that did not show up for appointment and whether or not they received an Sms to remind them about the appointment.
# calulating the proportion of patients who received sms and those who did not
no_sms=(12112/total_002)*100 #Proportion of those who did not show up for appointment and did not receive an sms
sms=(9568/total_001)*100 #Proportion of those who did not show up for appointment but received an sms
print('The proportion of patients who did not show up for appointment but received an sms is',sms, "and",no_sms, " for those who did not receive an sms")
The proportion of patients who did not show up for appointment but received an sms is 27.665172762758424 and 16.728819645866135 for those who did not receive an sms
# plotting the bar chart of the proportion of patients who received sms and those who did not
locations = [1,2]
label=['Received_sms','No_sms']
heights=[sms, no_sms]
plt.bar(locations, heights,tick_label=label,color=['blue','yellow'])
plt.xlabel('Sms', fontsize=14)
plt.ylabel('percentage of No_show', fontsize=14)
plt.title('Relationship between Sms and No_show', fontsize=14);
df_noshow.groupby(['SMS_received'])[['No_show']].count()
| No_show | |
|---|---|
| SMS_received | |
| No | 12112 |
| Yes | 9568 |
From the bar chart above, we see that the majority of patients that received Sms did not show up for appointments
In order to answer the above research question, we will make use of the pie chart. We are going to investigate the top 5 neighbourhoods that do not show up for appointments.
# In the data from for No_show, we check the value counts of neighbourhoods that dont show up for appointments
# then plot the pie .
df_noshow['Neighbourhood'].value_counts()[:5].plot(kind='pie',figsize=(10,10),autopct='%0.1F%%');
plt.title('Neighbourhoods with least number of appointments',fontsize=20);
pd.DataFrame(df_noshow['Neighbourhood'].value_counts()[:5])
| Neighbourhood | |
|---|---|
| JARDIM CAMBURI | 1432 |
| MARIA ORTIZ | 1194 |
| ITARARÉ | 898 |
| RESISTÊNCIA | 875 |
| CENTRO | 692 |
These are among the top five Neighbourhoods that don't show up for appointments
# Checking the count of patients with Hypertension who did not show up for appointments
stat('Hypertension')
| Hypertension | No_show | |
|---|---|---|
| 0 | No | 17908 |
| 1 | Yes | 3772 |
# plotting the pie chart to show the percentage of no show appointments in relation to Hypertension using
# the function chart defined earlier
print('Proportion of no-show appointments in relation to Hypertension.')
chart('Hypertension')
Proportion of no-show appointments in relation to Hypertension.
From the pie chart above, we see that 17.4% of the patients who did not show up for appointment had hypertension and 82.6% of patients who had no hypertension did not show up. The majority of patients who did not show up for appointments had no Hypertension.
# Checking the summary statistics of patients with Diabetes who did not show up for appointments.
df_noshow.Diabetes.describe()
count 21680 unique 2 top No freq 20250 Name: Diabetes, dtype: object
# plotting the pie chart to show the percentage of no show appointments in relation to Diabetes using
# the function defined earlier
print('Proportion of no-show appointments in relation to Diabetes.')
chart('Diabetes')
Proportion of no-show appointments in relation to Diabetes.
From the pie chart above, we see that 6.6% of the patients who did not show up for appointment had Diabetes and 93.4% of patients who had no Diabetes did not show up. The majority of patients who did not show up for appointments had no Diabetes
# Checking the count of patients with Alcoholism who did not show up for appointments
df_noshow.Alcoholism.describe()
count 21680 unique 2 top No freq 21003 Name: Alcoholism, dtype: object
# Checking the count of patients with Alcoholism who did not show up for appointments
stat('Alcoholism')
| Alcoholism | No_show | |
|---|---|---|
| 0 | No | 21003 |
| 1 | Yes | 677 |
# plotting the pie chart to show the percentage of no show appointments in relation to Alcoholism using
# the function chart
print('Proportion of no-show appointments in relation to Alcoholism.')
chart('Alcoholism')
Proportion of no-show appointments in relation to Alcoholism.
From the pie chart above, we see that 3.12% of the patients who did not show up for appointment had Alcoholism and 96.9% of patients who had no Alcoholism did not show up. The majority of patients who did not show up for appointments had no Alcoholism problem.
We now group the No_show column by Hypertension, Alcoholism, and Diabetes, in order to check if there is any relationship between the No_show and the patients medical condition.
#plotting the grouped bar chart
df_noshow.groupby(['Hypertension','Alcoholism','Diabetes'])['No_show'].value_counts().unstack().plot(kind="barh",figsize=(10,10),color='violet');
plt.title ('Groped bar Chart for Patients medical conditions', fontsize=14)
plt.xlabel('Counts of No-show appointments', fontsize=14)
plt.ylabel( 'Patients with Hypertension,Alcoholism and Diabetes',fontsize=14);
From the above analysis, we observe that patients who have no Hypertension, Diabetes and who are not Alcoholic are more likely to miss appointments.
In this project, we investigated the No-show appointment dataset for Medical appointments in Brazil for the year 2016. Upon loading the data and checking for cleanliness, it was observed that the dataset contained no duplicates and no missing or null values. The only cleaning that was done was to drop some extraneous columns which were not relevant for our analysis such as patientID, Handcap and AppointmentID. One of the limitation for the Handcap column was that it contained numerical values 0, 1, 2,3 and 4 which were difficult to encode properly and as a result, this column had to be dropped. The other limitation was that the dataset contained a negative value for age and all those rows with age has $0$ were dropped. The limitation could be that there maybe there could be other better ways of dealing with such cases and also, the dataset set contained more object data types than numeric ones.
In the data cleaning process, we also made a few changes to the column names by putting underscores to ScheduledDay and AppointmentDay and changing No-show to No_show. We also made changes to the format for Scheduled_day and Appointment_day to date-time.
From the exploratory data analysis, we endeavored to answer the questions that was posed in order to determine whether or not patients show up for appointments and also find out what factors are important to know in order to predict if a patient will show up for appointment or not.
Upon undertaking the exploratory data analysis, it was observed that females are more likely to miss appointments than males though with a small margin. It was also observed that patients who have no scholarships are not likely to miss appointments as compared to those who have. We further observed that the majority of no show appointments were common in children and aldults as opposed to the eldery. In conclusion, we also observed that the majority of patients that recieved Sms did not show up for appointments.